There is so many different factors that lead to success in NCAA Division 1 Men’s college basketball so in this project I decided to analyze what are some of those factors. The data set I decided to analyze is from Kaggle and has data from each major and minor Division 1 program in multiple different categories over the course of 11 years (2013-2024).
The questions I want to answer are:
How does a team’s 3 point percentage affect how well they do in the season?
If a team great defensive stats lead to more success in the season to find out if defense wins games?
Here is a list of common arguments:
According to the first graph which relates the average 3 point percentage of each conference and each year with the average amount of wins respectively has a weak positive correlation which proves that a better 3 point percentage might correlate to a higher amount of wins. Within the graph there are 4 conferences that are somewhat outliers they are Mid-Eastern Athletic Conference (MEAC), Southwestern Athletic Conferences (SWAC), Independent Teams (Ind), and Great West Conferences (GWC). But overall teams seem to have 3 point percentages between 33 and 36 percent per year.
In the second graph which relates the average power rating of each conference and each year with the average amount of wins respectively. There is no real correlation between the power rating and 3 point percentage. There seems to be 3 different clusters of data with the top being top conferences like the Big Ten (B10) and Big 12 (B12) which makes since because they are some of the top schools playing against each other. Then you have a middle cluster with conferences like the Atlantic 10 (A10) and American league (Amer) which are teams that play both good teams and bad which puts them in the middle. Then you have the lower cluster which are team that are bad and play other bad teams like the Ivy league (IVY). It looks as if we have the same outliers in the first graph as we do in the second but the Great West Conferences (GWC) looks to be closer to the lower cluster group of data we have.
According to the first graph which relates the average Defensive efficiency of each conference and each year with the average amount of wins respectively has an extremely strong negative correlation. We can easily see that all the conferences follow the trend that the higher your defensive efficency the lower amount of wins per season is. There is only one clear outlier which is Great West Conference (GWC).
According to the first graph which relates the average Defensive efficiency of each conference and each year with the average amount of wins this doesn’t have as strong of a correlation as the previous graph but still has a negative correlation showing that the less amount of field goals you allow the higher your power ratings. Once again the Great West Conference (GWC), Mid-Eastern Athletic Conference (MEAC), Southwestern Athletic Conferences (SWAC) and Independent Teams (Ind) are all outliers in this graph.
The first graph is the same as the average 3 point percentage of each conference and each year with the average amount of wins respectively. It also has all the teams from the 4 outlier conferences to see if there are any outliers teams in those conferences. In the graph all the teams are all clustered around the outlier conferences with some teams in with the majority of teams and some teams doing way worse than their conference average. So these conferences are just really bad with some good teams and some bad with majority near the average.
The second graph the average Defensive efficiency of each conference and each year with the average amount of wins respectively with all the teams from the 4 outlier conferences. It has a similar outcome of the first graph with really bad with some good teams and some bad with majority near the average.
---
title: "Analysis of D1 Basketball"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: minty
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup}
library(flexdashboard)
library(tidyverse)
library(ggrepel)
library(DT)
library(plotly)
### All data sets from 2013-2024 season
bball13 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb13.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball14 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb14.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball15 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb15.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball16 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb16.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball17 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb17.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball18 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb18.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball19 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb19.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball20 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb20.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball21 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb21.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball22 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb22.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFGD_D, TORD, X3P_O) %>%
rename(EFG_D = EFGD_D)
bball23 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb23.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFG_D, TORD, X3P_O)
bball24 <- read.csv("C:/Users/middi/OneDrive/Documents/MTH 209/Final/Data/cbb24.csv") %>%
select(TEAM, CONF, G, W, ADJOE, ADJDE, BARTHAG, EFGD., TORD, X3P_O) %>%
rename(EFG_D = EFGD.)
bball <- rbind(bball13,bball14,bball15,bball16,bball17,bball18,bball19,bball20,bball21,bball22,bball23,bball24)
bball$CONF <- (recode(bball$CONF, "ind" = "Ind"))
```
Introduction
===
Column {data-width=400}
---
### Introduction
There is so many different factors that lead to success in NCAA Division 1 Men's college basketball so in this project I decided to analyze what are some of those factors. The data set I decided to analyze is from [Kaggle](https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset) and has data from each major and minor Division 1 program in multiple different categories over the course of 11 years (2013-2024).
The questions I want to answer are:
- How does a team's 3 point percentage affect how well they do in the season?
- If a team great defensive stats lead to more success in the season to find out if defense wins games?
### Variables
Here is a list of common arguments:
- **TEAM** = The Division I college basketball school
- **CONF** = The Athletic Conference in which the school participates in (A10 = Atlantic 10, ACC = Atlantic Coast Conference, AE = America East, Amer = American, ASun = ASUN, B10 = Big Ten, B12 = Big 12, BE = Big East, BSky = Big Sky, BSth = Big South, BW = Big West, CAA = Colonial Athletic Association, CUSA = Conference USA, Horz = Horizon League, Ivy = Ivy League, Ind = Independent Team, MAAC = Metro Atlantic Athletic Conference, MAC = Mid-American Conference, MEAC = Mid-Eastern Athletic Conference, MVC = Missouri Valley Conference, MWC = Mountain West, NEC = Northeast Conference, OVC = Ohio Valley Conference, P12 = Pac-12, Pat = Patriot League, SB = Sun Belt, SC = Southern Conference, SEC = South Eastern Conference, Slnd = Southland Conference, Sum = Summit League, SWAC = Southwestern Athletic Conference, WAC = Western Athletic Conference, WCC = West Coast Conference)
- **G** = Games Played
- **W** = Wins
- **ADJOE** = Adjusted Offensive Efficiency: An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average D-I defense.
- **ADJDE** = Adjusted Defensive Efficiency: An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average D-I offense.
- **BARTHAG** = Power Rating: Chance of beating an average Division I team
- **EFG_D** = Effective Field Goal Percentage Allowed
- **TORD** = Steal Rate
- **X3P_O** = Three-Point Shooting Percentage
Column {data-width=600}
---
### Glimpse of Dataset
```{r glimpse}
datatable(bball[1:500,])
```
Variables
===
Column {.tabset data-width=1000}
---
### Adjusted Defensive Efficency
```{r vADJDE}
bball %>%
ggplot(aes(x=ADJDE)) + geom_histogram(fill = "darkblue", color = "black") + labs(title = "Distribution of Adjusted Defensive Efficency of Division 1 Teams", x = "Adjusted Defensive Efficency")
```
### Effective Field Goal Percentage Allowed
```{r vEFGD}
bball %>%
ggplot(aes(x=EFG_D)) + geom_histogram(fill = "darkgreen", color = "black") + labs(title = "Distribution of Effective Field Goal Percentage Allowed", x = "Effective Field Goal Percentage Allowed of Division 1 Teams")
```
### 3 Point Percentage
```{r 3s}
conf_colors <- rainbow(length(unique(bball$CONF)))
boxplot(X3P_O ~ CONF, data = bball, main = "Distribution of 3 Point Percentage", xlab = "Division 1 Conferences", ylab="3 Point Percentage", cex.axis = 0.7, las = 2, col = conf_colors)
```
---
Conferences and Offensive Success
===
Column {.tabset data-width=500}
---
### Wins by 3 point percentage
```{r 3s and win}
average_wins <- bball %>%
group_by(CONF) %>%
summarize(AverageWins = mean(W, na.rm = TRUE),avg3 = mean(X3P_O, na.rm=TRUE))
ggplot(average_wins, aes(x = avg3, y = AverageWins)) + geom_point(col = conf_colors) + geom_text_repel(aes(label = CONF)) + labs(x = "Average 3 point Percentage per season", y = "Average Wins per season")
```
### Power Rating by 3 point percentage
```{r power rating and 3s}
bball %>%
group_by(CONF) %>%
summarize(avgpow = mean(BARTHAG, na.rm = TRUE), avg3 = mean(X3P_O, na.rm=TRUE)) %>%
ggplot(aes(x = avg3, y = avgpow)) + geom_point(col = conf_colors) + geom_text_repel(aes(label = CONF)) + labs(x = "Average 3 point Percentage per season", y = "Average Power Rating per season")
```
---
Column {data-width=500}
---
### Summary
According to the first graph which relates the average 3 point percentage of each conference and each year with the average amount of wins respectively has a weak positive correlation which proves that a better 3 point percentage might correlate to a higher amount of wins. Within the graph there are 4 conferences that are somewhat outliers they are Mid-Eastern Athletic Conference (MEAC), Southwestern Athletic Conferences (SWAC), Independent Teams (Ind), and Great West Conferences (GWC). But overall teams seem to have 3 point percentages between 33 and 36 percent per year.
In the second graph which relates the average power rating of each conference and each year with the average amount of wins respectively. There is no real correlation between the power rating and 3 point percentage. There seems to be 3 different clusters of data with the top being top conferences like the Big Ten (B10) and Big 12 (B12) which makes since because they are some of the top schools playing against each other. Then you have a middle cluster with conferences like the Atlantic 10 (A10) and American league (Amer) which are teams that play both good teams and bad which puts them in the middle. Then you have the lower cluster which are team that are bad and play other bad teams like the Ivy league (IVY). It looks as if we have the same outliers in the first graph as we do in the second but the Great West Conferences (GWC) looks to be closer to the lower cluster group of data we have.
---
Conferences and Defensive Success
===
Column {.tabset data-width=500}
---
### Wins and Defensive Efficiency
```{r wins and defense}
winsdef <- bball %>%
group_by(CONF) %>%
summarize(AverageWins = mean(W, na.rm = TRUE),def = mean(ADJDE, na.rm=TRUE))
ggplot(winsdef, aes(x = def, y = AverageWins)) + geom_point(col = conf_colors) + geom_text_repel(aes(label = CONF)) + labs(x = "Average Defensive Efficiency per season", y = "Average Wins per season")
```
### Wins and Shooting defense
```{r wins and def}
winsshot <- bball %>%
group_by(CONF) %>%
summarize(AverageWins = mean(W, na.rm = TRUE),def = mean(EFG_D, na.rm=TRUE))
ggplot(winsshot, aes(x = def, y = AverageWins)) + geom_point(col = conf_colors) + geom_text_repel(aes(label = CONF)) + labs(x = "Average Field Goal Percentage Allowed", y = "Average Wins per season")
```
---
Column {data-width=500}
---
### Summary
According to the first graph which relates the average Defensive efficiency of each conference and each year with the average amount of wins respectively has an extremely strong negative correlation. We can easily see that all the conferences follow the trend that the higher your defensive efficency the lower amount of wins per season is. There is only one clear outlier which is Great West Conference (GWC).
According to the first graph which relates the average Defensive efficiency of each conference and each year with the average amount of wins this doesn't have as strong of a correlation as the previous graph but still has a negative correlation showing that the less amount of field goals you allow the higher your power ratings. Once again the Great West Conference (GWC), Mid-Eastern Athletic Conference (MEAC), Southwestern Athletic Conferences (SWAC) and Independent Teams (Ind) are all outliers in this graph.
Outliers
===
Column {.tabset data-width=600}
---
### Outliers and Conference wins by 3 point percentage
```{r outs3}
bballouts <- bball %>%
filter(CONF == "GWC" | CONF == "MEAC" | CONF == "SWAC" | CONF == "Ind") %>%
group_by(TEAM) %>%
summarize(AverageWins = mean(W, na.rm = TRUE),avg3 = mean(X3P_O, na.rm=TRUE))
bballoutsconf <- bball %>%
filter(CONF == "GWC" | CONF == "MEAC" | CONF == "SWAC" | CONF == "Ind") %>%
group_by(CONF) %>%
summarize(AverageWins = mean(W, na.rm = TRUE),avg3 = mean(X3P_O, na.rm=TRUE))
ggplot() + geom_point(data = average_wins, aes(x = avg3, y = AverageWins, col = "Conferences"), shape = 16) + geom_point(data = bballouts, aes(x = avg3, y = AverageWins, col = "Outlier Teams"), shape = 17) + geom_point(data = bballoutsconf, aes(x = avg3, y = AverageWins, col = "Outlier Conferences"), shape = 18) + labs(x = "Average 3 point Percentage per Season", y = "Average Wins per Season", color = "Group") + theme_minimal()
```
### Outliers and Conference Wins and Defensive Efficiency
```{r outsd}
bballoutsd <- bball %>%
filter(CONF == "GWC" | CONF == "MEAC" | CONF == "SWAC" | CONF == "Ind") %>%
group_by(TEAM) %>%
summarize(AverageWins = mean(W, na.rm = TRUE),def = mean(ADJDE, na.rm=TRUE))
bballoutsconfd <- bball %>%
filter(CONF == "GWC" | CONF == "MEAC" | CONF == "SWAC" | CONF == "Ind") %>%
group_by(CONF) %>%
summarize(AverageWins = mean(W, na.rm = TRUE),def = mean(ADJDE, na.rm=TRUE))
ggplot() + geom_point(data = winsdef, aes(x = def, y = AverageWins, col = "Conferences"), shape = 16) + geom_point(data = bballoutsd, aes(x = def, y = AverageWins, col = "Outlier Teams"), shape = 17) + geom_point(data = bballoutsconfd, aes(x = def, y = AverageWins, col = "Outlier Conferences"), shape = 18) + labs(x = "Average Defensive Efficiency per season", y = "Average Wins per Season", color = "Group") + theme_minimal()
```
---
Column {data-width=400}
---
### Summary
The first graph is the same as the average 3 point percentage of each conference and each year with the average amount of wins respectively. It also has all the teams from the 4 outlier conferences to see if there are any outliers teams in those conferences. In the graph all the teams are all clustered around the outlier conferences with some teams in with the majority of teams and some teams doing way worse than their conference average. So these conferences are just really bad with some good teams and some bad with majority near the average.
The second graph the average Defensive efficiency of each conference and each year with the average amount of wins respectively with all the teams from the 4 outlier conferences. It has a similar outcome of the first graph with really bad with some good teams and some bad with majority near the average.
---
Conclusion/About Author
===
Column(Data-width=600)
---
### Conclusion
In conclusion of the study, according to the graphs we can conclude that the higher 3 point percentage leads to more success in winning games but having a higher 3 point percentage doesn't necessarily correlate to having a higher power ratings. On the defensive side, we can conclude that having a lower defensive efficiency leads to having a lot more success in winning games with a clear forming a almost perfect line. Also having a lower field goal percentage allowed leads to having more wins per season. So this proves that the better 3 point percentage you have leads to more wins and that defense wins games! Also we can conclude from the outliers that they are just bad D1 conferences that have not so good teams
### The Why
Why I chose to pick the D1 basketball dataset was because I never really was interested in basketball until I came to Dayton and now I love it but just college ball. I also love to understand sports by the numbers so this is a perfect way for me to do both. I also chose my questions because 3 pointers are a huge way to make an impact on the game and they are hard for most players to shoot so to figure out if they really make that big of an impact. Also the term you need defense to win championships is used in almost every sport and I have been told that from a very young age so to figure out if that is true in college basketball will be interesting to find out.
Column(Data-width=400)
---
### About the Author
My name is Evan McClelland and I am a junior studying Mechanical Engineering Student at The University of Dayton with a minor in Data Analytics.
Connect with me on [LinkedIn](https://www.linkedin.com/in/evanmcclelland3/)